Setup

Packages

We’ll be using pacman to ensure packages are installed:

if (!require("pacman")) install.packages("pacman", repos = "https://cloud.r-project.org")

pacman::p_load("tidyverse", "data.table", "skimr", "plotly", "archive", "scales")

set.seed(1)

Data Import

Three packages for import - readr for delimited rectangular files, archive for dealing with remote zips without needing to download (mostly), haven for pulling data that was saved from other systems that have their own specific files types (SAS, Stata, SPSS). Both will attempt to guess the types of the data in the file based on structure.

acd <- readr::read_csv(archive_read("https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-231-csv.zip", file=1))
## Rows: 2626 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (14): location, side_a, side_a_id, side_a_2nd, side_b, side_b_id, side_...
## dbl  (10): conflict_id, incompatibility, year, intensity_level, cumulative_i...
## lgl   (1): ep_end_prec
## date  (3): start_date, start_date2, ep_end_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
acd

If we were instead using the Stata version (.dta) - we could get the same result with haven

acd_stata <-haven::read_dta(archive_read("https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-231-dta.zip", file = 1))
acd_stata

Things aren’t perfect - you can see that type idetnification failed on all the dates for the Stata file - but not the CSV. Interestingly, this seems to be because the authors incorrectly stored the Stata variables as strings instead of dates so readr accepted the Stata formatting rather than use its best guess. You can see when given an unformatted string with the CSV file readr correctly interpreted those as dates.

Cleaning

Mutate and select

Let’s find all of the date fields and convert them from strings to date:

acd_stata %>%
  mutate(across(contains("date"), lubridate::ymd))

Some useful things here. First, mutate() is the syntax for creating columns from existing columns. across() is the syntax for columnwise operations - I would like to apply the to be named function “across” to be selected columns. For example, summarize(across(a:c, mean)) would return the mean of columns named a and c, as well as any columns in between. However, we need a more complicated selection - we want all the date columns. We could simply name them - i.e.,

acd_stata %>%
  mutate(across(c(start_date, start_date2, ep_end_date), lubridate::ymd)) %>%
  dplyr::select(contains("date"))

(I’ve added the final selector line to save having to look for the correct columns). But this is more tedious and error prone. If column names follow a known pattern - all date columns contain the word date, for example, we can use a contains(), which checks whether the name of the variable contains a provided substring, like date.

The final parameter is the format of the date. If you’ve used base R for dates before, then you might recognize something like this:

acd_stata %>%
  mutate(across(c(start_date, start_date2, ep_end_date), function(x) as.Date(x, format="%Y-%m-%d"))) %>%
  dplyr::select(contains("date"))

This is using R’s base as.Date() function. This is unnecessarily complicated as you have to specify using symbolic notation how your dates are formatted. Here, we have %Y for four digit year, followed by %m for numeric month and finally %d for numeric day. All of these are separated by hyphens, because that’s how our dates are written. Because we’re mixing tidyverse and base, we have to specify a function call inside the mutate - i.e. I’m defining a function that takes an input x, and applies the base R conversion.

Base date conversions fail if you don’t specify things correctly:

acd_stata %>%
  mutate(across(c(start_date, start_date2, ep_end_date), function(x) as.Date(x, format="%y-%m-%d"))) %>%
  dplyr::select(contains("date"))

What’s different here? I used %y which is the notation for a two digit year. Our dates to not match the specified format so conversion fails. The lubridate package, which is part of tidyverse, has better functions for date conversion. Specifically ymd() works for any dates that are any year/month/day format - even those without standard delimiters.

test_dates <- data.frame(date=c("1990-April-10", "90/04/10", "1990 4 10", "1990 April 10", "1990April10"))
test_dates %>%
  mutate(date = ymd(date))

Pipes

You’ll notice the pipe character (%>%) throughout the above. A pipe allows you to work on the intermediate results of a series of commands without storing those results in memory or using ugly nested commands. This is very helpful for long code blocks. As we’ll see later, you can also modify the data and then pipe it to a plotting functions without having to store the modified data.

For example, in the below code block there are three steps. First we call the dataframe we want to operate on, here acd_stata. Then we pipe that object to mutate. Notice if we were thinking like base R this would fail - I haven’t attached acd_stata but I’ve called variables from it without telling R where to look - e.g. acd_stata$start_date. But, since we’ve piped acd_stata to the mutate call, it’s understood that those variables are in acd_stata.

The third step takes the resulting dataframe, with the correctly parsed dates, and selects some columns from it. The intermediate result, with the correct dates but all columns is never stored anywhere. If we were to store the results of this block, it would be a tibble with three columns.

acd_stata %>%
  mutate(across(c(start_date, start_date2, ep_end_date), lubridate::ymd)) %>%
  dplyr::select(contains("date"))

The pipe I’m using is referred to as the ‘magrittr pipe’ because it’s imported by tidyverse from magrittr. Prior to R 4.1.0, R had no base pipe function so this is what we used. As of that version, R includes its own base pipe character (|>). These seem mostly the same although see here. If I were learning R today, I would stick with the base pipe but old habits plus backwards compatibility…

separate_wider, pivot_longer, and a little bit of stringr with regex

Back to data: additional issue - data stores certain things as comma separated strings within a single column. This type of storage is difficult for analysis.

acd %>%
  arrange(desc(nchar(side_b_id))) %>%
  dplyr::select(side_a, side_b, side_b_id)
acd_wide<-acd %>%
  arrange(desc(nchar(side_b_id))) %>%
  separate_wider_delim(c(side_a_id, side_b_id),
                       delim=stringr::regex("\\s*+,\\s*+"),
                       names_sep = "_", too_few = 'align_start')
acd_wide

A couple things are happening here. separate_wider_delim() splits text fields on a delimiter. First argument is columns we want to split. Could also use tidyselect functions here - e.g., everything(), contains(). Second argument is the delimiter - where to split the strings. These strings are split by a comma followed by a whitespace character, but sometimes there are typos like multiple whitespaces. We can make the argument general with regex - which matches on any number of white spaces on either side of a comma. If we were confident the delimiters were always the same we could just write delim = “,” ( is a placeholder whitespace, in reality it would be a space but markdown won’t render it) - but this is more robust to errors.

Wide data like this is not ‘tidy’ - each of these columns with made of out the ids are actually different observations of the same variable - the actors. So we should pivot this data longer to make it tidier.

acd_long<-acd_wide %>%
  pivot_longer(contains(c("side_a_id")), names_to = NULL, values_to = "side_a_id") %>%
  relocate(side_a_id, .after=conflict_id) %>%
  pivot_longer(contains(c("side_b_id")), names_to = NULL, values_to = "side_b_id") %>%
  relocate(side_b_id, .after=side_a_id) %>%
  filter(!is.na(side_a_id) & !is.na(side_b_id))

acd_long

Pipes are helpful here because we need to do a few things. First, we need two pivots because we don’t want to make side a and and side b into one column. pivot_longer() takes the columns and makes them row observations of the same column. By default the column names will become a new column titled “name” which contains the name of the column that was pivoted. Value will hold the value of that column. We can overwrite these defaults. relocate() moves the created columns to new locations in the tibble - the default is first position, but .before or .after can be used to be more explicit.

We could (and should) have done this in one step, but I wanted you to see pivot.

acd %>%
  arrange(desc(nchar(side_b_id))) %>%
  separate_longer_delim(side_a_id, 
                        delim=stringr::regex("\\s*+,\\s*+")) %>%
  separate_longer_delim(side_b_id,
                        delim=stringr::regex("\\s*+,\\s*+")) %>%
  relocate(side_a_id, side_b_id, .after=conflict_id)

stringr

We used stringr briefly above, but stringr is tidyverse’s package for string data. stringr has a number of nice functions that all start with str_

Lookup

Here we can pull everything with the word Sudan in location

acd %>%
  filter(str_detect(location, "Sudan"))

Substitute

Or we could replace strings

acd %>%
  mutate(location = str_replace_all(location, "South Sudan", "S. Sudan")) %>%
  filter(str_detect(location, "Sudan"))

Descriptive Statistics

First we’re going to get some more data - have to do this in two steps as read_rds() does not work with the archive functions we were using before. To prevent unnecessary downloading, I’ve put the download in an if statement that checks the files in the project directory for the one we need. As a result, this should download the file once.

if (!"GEDEvent_v23_1.rds" %in% list.files()){
  archive_extract("https://ucdp.uu.se/downloads/ged/ged231-rds.zip", file=1)
}

ged<-read_rds("GEDEvent_v23_1.RDS")

Tidyverse has a number of useful ways to generate descriptive statistics. It’s worth noting that these usually transform the data so don’t overwrite the existing objects with the summaries. First we can start with glimpse() which does what you’d think:

glimpse(ged)
## Rows: 316,818
## Columns: 49
## $ id                <dbl> 244657, 412700, 413023, 412909, 132140, 130364, 1303…
## $ relid             <chr> "IRQ-2017-1-524-322", "IRQ-2021-1-524-145", "IRQ-202…
## $ year              <dbl> 2017, 2021, 2021, 2021, 1989, 1989, 1989, 1989, 1989…
## $ active_year       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ code_status       <chr> "Clear", "Clear", "Clear", "Clear", "Clear", "Clear"…
## $ type_of_violence  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ conflict_dset_id  <dbl> 259, 259, 259, 259, 333, 333, 333, 333, 333, 333, 33…
## $ conflict_new_id   <dbl> 259, 259, 259, 259, 333, 333, 333, 333, 333, 333, 33…
## $ conflict_name     <chr> "Iraq: Government", "Iraq: Government", "Iraq: Gover…
## $ dyad_dset_id      <dbl> 524, 524, 524, 524, 724, 724, 724, 724, 724, 724, 72…
## $ dyad_new_id       <dbl> 524, 524, 524, 524, 724, 724, 724, 724, 724, 724, 72…
## $ dyad_name         <chr> "Government of Iraq - IS", "Government of Iraq - IS"…
## $ side_a_dset_id    <dbl> 116, 116, 116, 116, 130, 130, 130, 130, 130, 130, 13…
## $ side_a_new_id     <dbl> 116, 116, 116, 116, 130, 130, 130, 130, 130, 130, 13…
## $ side_a            <chr> "Government of Iraq", "Government of Iraq", "Governm…
## $ side_b_dset_id    <dbl> 234, 234, 234, 234, 292, 292, 292, 292, 292, 292, 29…
## $ side_b_new_id     <dbl> 234, 234, 234, 234, 292, 292, 292, 292, 292, 292, 29…
## $ side_b            <chr> "IS", "IS", "IS", "IS", "Jam'iyyat-i Islami-yi Afgha…
## $ number_of_sources <dbl> 3, 15, 5, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ source_article    <chr> "\"Agence France Presse,2017-08-01,At least 20 kille…
## $ source_office     <chr> "Agence France Presse;Agence France Presse;Pajhwok N…
## $ source_date       <chr> "2017-08-01;2017-08-01;2017-07-31", "2021-08-26;2021…
## $ source_headline   <chr> "At least 20 killed in Shiite mosque attack in Afgha…
## $ source_original   <chr> "IS, interior ministry, security source", "US offici…
## $ where_prec        <dbl> 1, 1, 1, 1, 4, 4, 1, 1, 2, 4, 1, 1, 1, 1, 1, 4, 1, 6…
## $ where_coordinates <chr> "Kabul city", "Kabul international airport", "Jalala…
## $ where_description <chr> "Iraqi embassy in Kabul", "Kabul airport (Abbey gate…
## $ adm_1             <chr> "Kabul province", "Kabul province", "Nangarhar provi…
## $ adm_2             <chr> "Kabul district", "Kabul district", "Jalalabad distr…
## $ latitude          <dbl> 34.53109, 34.56444, 34.42884, 34.53109, 34.33333, 36…
## $ longitude         <dbl> 69.16280, 69.21722, 70.45575, 69.16280, 70.41667, 68…
## $ geom_wkt          <chr> "POINT (69.162796 34.531094)", "POINT (69.2172222 34…
## $ priogrid_gid      <dbl> 179779, 179779, 179061, 179779, 179061, 182658, 1804…
## $ country           <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghan…
## $ country_id        <dbl> 700, 700, 700, 700, 700, 700, 700, 700, 700, 700, 70…
## $ region            <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ event_clarity     <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ date_prec         <dbl> 1, 1, 1, 1, 3, 2, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ date_start        <dttm> 2017-07-31, 2021-08-26, 2021-08-28, 2021-08-29, 198…
## $ date_end          <dttm> 2017-07-31, 2021-08-26, 2021-08-28, 2021-08-29, 198…
## $ deaths_a          <dbl> 0, 13, 0, 0, 6, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ deaths_b          <dbl> 4, 1, 2, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 3, 10, 0,…
## $ deaths_civilians  <dbl> 0, 141, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 5, 7, 0, 0, 0…
## $ deaths_unknown    <dbl> 2, 28, 0, 0, 0, 4, 600, 2, 70, 0, 7, 0, 0, 0, 0, 0, …
## $ best              <dbl> 6, 183, 2, 10, 6, 4, 600, 2, 70, 20, 7, 1, 5, 7, 3, …
## $ high              <dbl> 6, 184, 3, 10, 6, 4, 600, 2, 70, 20, 7, 1, 5, 7, 3, …
## $ low               <dbl> 6, 171, 0, 9, 6, 0, 600, 0, 70, 0, 0, 1, 5, 7, 0, 10…
## $ gwnoa             <dbl> 645, 645, 645, 645, 700, 700, 700, 700, 700, 700, 70…
## $ gwnob             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

We can get summary statistics for the numeric variables

ged %>% 
  summarize(across(where(is.numeric), median))

We could build this out into a full table, but unnecessary as packages exist for this (technically not tidyverse, but integrated). This is skimr:

skim(ged)
Data summary
Name ged
Number of rows 316818
Number of columns 49
_______________________
Column type frequency:
character 18
numeric 29
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
relid 0 1.00 14 25 0 316818 0
code_status 0 1.00 5 5 0 1 0
conflict_name 0 1.00 8 243 0 1419 0
dyad_name 0 1.00 8 243 0 1649 0
side_a 0 1.00 2 190 0 904 0
side_b 0 1.00 2 238 0 901 0
source_article 0 1.00 6 29230 0 210712 0
source_office 100061 0.68 2 5909 0 16435 0
source_date 100061 0.68 10 3970 0 37619 0
source_headline 100061 0.68 1 18627 0 141590 0
source_original 41292 0.87 1 3791 0 46258 0
where_coordinates 0 1.00 3 109 0 46834 0
where_description 6767 0.98 1 14898 0 179413 0
adm_1 17839 0.94 5 43 0 1636 0
adm_2 66665 0.79 3 51 0 8310 0
geom_wkt 0 1.00 11 49 0 47957 0
country 0 1.00 4 31 0 124 0
region 0 1.00 4 11 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 242230.90 131299.68 4.00 133657.25 247543.00 355495.75 468640.00 ▆▆▇▇▇
year 0 1.00 2010.63 8.92 1989.00 2005.00 2013.00 2017.00 2022.00 ▂▂▂▇▇
active_year 0 1.00 0.97 0.18 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
type_of_violence 0 1.00 1.44 0.74 1.00 1.00 1.00 2.00 3.00 ▇▁▂▁▂
conflict_dset_id 0 1.00 2819.24 5131.13 6.00 299.00 333.00 432.00 17446.00 ▇▁▁▁▁
conflict_new_id 0 1.00 2826.01 5007.56 205.00 299.00 354.00 572.00 15859.00 ▇▁▁▁▂
dyad_dset_id 0 1.00 5896.41 6162.38 6.00 735.00 848.00 11973.00 17446.00 ▇▁▁▃▂
dyad_new_id 0 1.00 6105.96 6151.62 406.00 750.00 999.00 11973.00 17446.00 ▇▁▁▃▂
side_a_dset_id 0 1.00 497.90 1392.62 3.00 116.00 118.00 146.00 8586.00 ▇▁▁▁▁
side_a_new_id 0 1.00 497.90 1392.62 3.00 116.00 118.00 146.00 8586.00 ▇▁▁▁▁
side_b_dset_id 0 1.00 3294.96 3572.02 3.00 304.00 775.00 4456.00 9999.00 ▇▁▃▁▂
side_b_new_id 0 1.00 1759.15 2272.29 1.00 234.00 349.00 4456.00 8589.00 ▇▁▂▁▁
number_of_sources 0 1.00 0.68 1.64 -1.00 -1.00 1.00 1.00 361.00 ▇▁▁▁▁
where_prec 0 1.00 2.05 1.27 1.00 1.00 2.00 3.00 7.00 ▇▂▁▁▁
latitude 0 1.00 25.82 15.18 -37.81 13.93 33.21 35.49 68.98 ▁▂▃▇▁
longitude 0 1.00 34.46 46.25 -117.30 29.87 37.20 65.44 155.90 ▁▁▇▃▁
priogrid_gid 0 1.00 166879.43 21881.12 75530.00 149493.00 177553.00 180505.00 228667.00 ▁▂▃▇▁
country_id 0 1.00 578.05 195.13 2.00 500.00 652.00 700.00 940.00 ▁▁▂▇▂
event_clarity 0 1.00 1.07 0.26 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
date_prec 0 1.00 1.23 0.67 1.00 1.00 1.00 1.00 5.00 ▇▁▁▁▁
deaths_a 0 1.00 1.86 55.67 0.00 0.00 0.00 1.00 14162.00 ▇▁▁▁▁
deaths_b 0 1.00 2.31 31.67 0.00 0.00 0.00 1.00 9505.00 ▇▁▁▁▁
deaths_civilians 0 1.00 4.33 196.06 0.00 0.00 0.00 1.00 40000.00 ▇▁▁▁▁
deaths_unknown 0 1.00 2.11 176.64 0.00 0.00 0.00 0.00 75340.00 ▇▁▁▁▁
best 0 1.00 10.60 273.51 0.00 1.00 2.00 5.00 75340.00 ▇▁▁▁▁
high 0 1.00 16.13 407.78 0.00 1.00 2.00 5.00 74256.00 ▇▁▁▁▁
low 0 1.00 8.77 218.43 0.00 1.00 2.00 4.00 75482.00 ▇▁▁▁▁
gwnoa 71121 0.78 611.96 156.45 2.00 560.00 652.00 700.00 910.00 ▁▁▂▇▂
gwnob 310668 0.02 457.37 162.44 2.00 369.00 369.00 369.00 800.00 ▁▁▇▁▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_start 0 1 1989-01-01 2022-12-31 2013-09-20 12410
date_end 0 1 1989-01-01 2022-12-31 2013-09-20 12412

group_by() allows the calculation of all statistics by group. For example, if we wanted counts by region:

ged %>% 
  group_by(region) %>%
  count()

More cleaning - factors

As is common in social science data, this data has a lot of categorical (factor) variables stored as numeric. For example, we have a five level date precision variable around the recorded date of the event in the data:

ged %>%
  dplyr::select(date_prec) %>%
  group_by(date_prec) %>%
  count()

If you look in the codebook, these values correspond with an estimate of the uncertainty, with 1 being exact and 5 being sometime more than a month but less than a year. We can convert this with the base as.factor(), but tidyverse has some useful functions out of the forcats package for categorical variables, like fct_recode() which recodes the levels in the data:

ged<-ged %>% 
  mutate(date_prec = fct_recode(as.factor(date_prec), 
                                "Exact" = "1",
                                "2-6 days" = "2",
                                "Week" = "3",
                                "Month" = "4",
                                "More than 1 Month, Less than 1 Year" = "5"))

ged %>%
  group_by(date_prec) %>%
  count()

Note that the syntax here is “new name” = “old name,” which is standard across tidyverse functions.

Plotting

ggplot2 is the tidyverse plotting package.

ged %>%
  ggplot(aes(y=best, x=date_end))

This hasn’t done anything - why not? ggplot treats graphics as consisting of two parts: aesthetic mappings and geometric objects. Unlike base R, where different functions imply different plots (plot() vs hist()), all ggplot plots start from this same aesthetic. All we’ve done is put in the data - this just tells ggplot what variables we are interested in. If we want to create a graphic, we need to tell it what geometries we want.

ged %>%
  ggplot(aes(y=best, x=date_end)) +
  geom_point()

geom_point() is a scatter plot. Notice we’ve switched to + to move between lines in ggplot rather than a pipe (%>%). ggplot results aren’t piped to each other; layers are added to aesthetics.

This is still not great - there’s a lot of overlap on the points, outliers are really making the visual hard to see, and it’s not very neat.

ged %>%
  ggplot(aes(y=log(best+1), x=date_end)) +
  geom_point(alpha=0.2)+
  theme_bw()

We’ve done a couple things here - first we’ve log transformed the data (not ideal for a count but illustrative). The nice thing here is you can apply the transforms within the data - you don’t need to create a new variable within the dataframe or replace the existing one. Second we’ve used the alpha parameter to control the opacity of the dots to make it easier to see where they overlap.

Still a lot of data - maybe we want to aggregate in some way

ged %>%
  group_by(year=year(date_end)) %>%
  summarize(sum=sum(best)/10^3) %>%
  ggplot(aes(y=sum, x=year)) +
  geom_line()+
  labs(x="Year", y="Fatalities (Thousands)")+
  theme_bw()

Here we’ve created an intermediate step with data summed over year:

ged %>%
  group_by(year=year(date_end)) %>%
  summarize(sum=sum(best)/10^3)

Year cutpoints are a bit rough - is December 31, 2023 more similar to January 1, 2024 or January 1, 2023? We could smooth the data instead. Default is a GAM with smoothing splines, but you can do others. Note: loess compute time increases quickly in sample, so will be slow.

ged %>%
  ggplot(aes(y=best, x=date_end)) +
  geom_smooth()+
  theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ged %>%
  ggplot(aes(y=best, x=date_end)) +
  geom_smooth(method=c("lm"))+
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Faceting

Faceting applies the same procedure over subgroups of the data defined by a variable.

ged %>%
  ggplot(aes(y=best, x=date_end)) +
  geom_smooth(method=c("lm"))+
  facet_grid(~region)+
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Bar Plot

Here’s a basic bar plot

ged %>%
  ggplot(aes(date_prec))+
  geom_bar()

If we wanted to see by region we could pass that as the fill variable. scale_fill_viridis() changes the color pallet. The default is different colors that do not exhibit significant variation in hue/saturation. As a result they can be difficult for readers with colorblindness or journals that still print in greyscale. Note that the default position is stacked. This is ok, but the smaller categories are hard to read.

ged %>%
  ggplot(aes(date_prec, fill=region))+
  geom_bar()+
  scale_fill_viridis_d()+
  theme_bw()

To switch from stacked change the position option:

ged %>%
  ggplot(aes(date_prec, fill=region))+
  geom_bar(position="dodge")+
  scale_fill_viridis_d()+
  theme_bw()

If we facet this by precision we can use free scales to see things better. The risk of free scales is that a quick reader will assume they’re the same. Here, we might think most of Africa’s events have precision worse than 2-6 days, but in reality almost 88% are Exact or 2-6 days. Comparison is only valid within the facet - e.g., Africa has more events at week, month, and greater precision than other regions.

ged %>%
  ggplot(aes(region, fill=region))+
  geom_bar(position="dodge")+
  scale_fill_viridis_d()+
  facet_wrap(~date_prec, scales="free")+
  theme_bw()

Getting there, but we have some bad labels and an unecessary legend!

final_bar<-ged %>%
  ggplot(aes(region, fill=region))+
  geom_bar(position="dodge")+
  scale_fill_viridis_d(guide="none")+
  labs(x="Region", y="Number of Events")+
  facet_wrap(~date_prec, nrow=3,
             scales="free", labeller = labeller(date_prec = label_wrap_gen(25)))+
  scale_x_discrete(labels = label_wrap(10)) +
  theme_bw()

final_bar

And we might want to save this one

ggsave(final_bar, file="barplot.pdf", width = 6, height =6, units="in")

The precision variable is ordinal - meaning the ordering of the values has meaning. Reording precision doesn’t make sense except maybe to reverse it so higher values indicate ‘higher’ precision. For nominal variables forcats + ggplot allows some nice reordering:

ged %>% 
  ggplot(aes(region))+
  geom_bar()+
  coord_flip()+
  theme_bw()

For region, there’s no real meaning - Europe is not intrinsically closer to Asia than Africa. So why not rearrange to make things prettier?

ged %>% 
  ggplot(aes(fct_rev(fct_infreq(region))))+
  geom_bar()+
  coord_flip()+
  theme_bw()

Last thing about geom_bar() - the default is a count of observations (stat=“count”), so it only accepts one option via aes() either x or y. You can override this using the function option, but it’s a bit clunky. My preference is to create the stat you want to plot prior and then pipe it into ggplot. This gives you more control and you only need to remember one option stat=“identity” which plots whatever you’ve supplied as y.

Let’s say you want to check for monthly seasonality across regions, but you only want the data where we are certain of the date. First, you filter based on date_prec, then group by the month and region. Next you use summarize to create a mean variable. That variable - avg - gets passed to ggplot as y. Finally, you change the stat for geom_bar to “identity.” Note that if you don’t change the stat it’ll error out.

ged %>%
  filter(date_prec=="Exact") %>%
  group_by(month = lubridate::month(date_end, label=T), region) %>%
  summarize(avg=mean(best)) %>%
  ggplot(aes(y=avg, x=month))+
  geom_bar(stat = "identity")+
  facet_wrap(~region, scales = "free", nrow=3)+
  theme_bw()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

Here again it might be useful to see what the data looks like before we pipe to ggplot:

ged %>%
  filter(date_prec=="Exact") %>%
  group_by(month = lubridate::month(date_end, label=T), region) %>%
  summarize(avg=mean(best))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

Histograms

Here’s a basic histogram. Note within the histogram function we’ve applied a calculation for optimal binwidth. If not ggplot picks 30 bins and complains.

hist<-ged %>%
  ggplot(aes(date_start))+
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))+
  theme_bw()

hist

Unlike the line plots, the histogram has performed an intermediate calculation that’s hidden, but can be extracted

layer_data(hist)

This is a bit confusing because our times are stored as POSIX which is harder to read as it’s a count of seconds since the beginning of 1970. We can convert back:

layer_data(hist) %>%
  mutate(xmin = as_datetime(xmin),
         xmax = as_datetime(xmax))

This is better - we can see that the first bin in our histogram has edges at October 31, 1988 and March 11, 1989 and contained 496 events.

We can also facet this plot. Note that the binwidth calculation is redone for each facet, so the binwidths are optimal within the group.

ged %>%
  ggplot(aes(date_start))+
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))+
  theme_bw()+
  facet_wrap(~region)

There are a lot of packages that aren’t part of the tidyverse but built to integrate with it. For example, the plotly R package can convert your ggplot figures into interactive plotly figures.

ggplotly(hist)
ggplotly(final_bar, width=800, height=800)